Skip to content

[Feature] Atomic Reduction Operations and Vectorization Enhancement#1676

Merged
LeiWang1999 merged 15 commits intotile-ai:mainfrom
LeiWang1999:atomic_0115
Jan 15, 2026
Merged

[Feature] Atomic Reduction Operations and Vectorization Enhancement#1676
LeiWang1999 merged 15 commits intotile-ai:mainfrom
LeiWang1999:atomic_0115

Conversation

@LeiWang1999
Copy link
Member

@LeiWang1999 LeiWang1999 commented Jan 15, 2026

Summary

This PR introduces atomic reduction operations (AtomicMax, AtomicMin) and enhances the existing atomic add functionality with automatic vectorization support and code architecture improvements.

Key Changes

  • New Atomic Reduction Operations: Added T.atomic_max and T.atomic_min for both element-wise and tile-region-based operations
  • Atomic Add Vectorization: Automatic vectorization of atomic_add_elem_op to atomic_addx2_elem_op (for float16/bfloat16) or atomic_addx4_elem_op (for float32 on SM>=90)
  • Code Refactoring: Refactored AtomicAddNode to inherit from AtomicOpBaseNode<T> using CRTP pattern, reducing code duplication
  • Return Previous Value: Support for return_prev=True parameter to retrieve the previous value in atomic operations
  • Naming Consistency: Fixed naming convention for atomic ops with return value (atomic_add_ret_elem_op, atomic_max_ret_elem_op, atomic_min_ret_elem_op)

New APIs

# Atomic Max - element-wise
T.atomic_max(dst[i, j], value)

# Atomic Max - tile-region based
T.atomic_max(dst_tensor, src_tensor)

# Atomic Min - element-wise
T.atomic_min(dst[i, j], value)

# Atomic Min - tile-region based  
T.atomic_min(dst_tensor, src_tensor)

# Atomic Add with return previous value
old_val = T.atomic_add(dst[i, j], value, return_prev=True)

# Memory ordering support
T.atomic_add(dst[i, j], value, memory_order="relaxed")

Example Usage

@tilelang.jit
def atomic_max_kernel(K, M, N, block_M, block_N, dtype=T.float32):
    @T.prim_func
    def atomic_max(A: T.Tensor((K, M, N), dtype), B: T.Tensor((M, N), dtype)):
        with T.Kernel(T.ceildiv(M, block_M), T.ceildiv(N, block_N), K, threads=32) as (bx, by, bz):
            A_shared = T.alloc_shared((block_M, block_N), dtype)
            T.copy(A[bz, bx * block_M : (bx + 1) * block_M, by * block_N : (by + 1) * block_N], A_shared)
            
            for i, j in T.Parallel(block_M, block_N):
                T.atomic_max(B[bx * block_M + i, by * block_N + j], A_shared[i, j])
    return atomic_max

Vectorization Pass

The new vectorization pass automatically detects ForKind::kVectorized loops containing atomic_add_elem_op and converts them to vectorized versions:

  • float16/bfloat16: Uses atomic_addx2_elem_op (2 elements per operation)
  • float32 on SM>=90: Uses atomic_addx4_elem_op (4 elements per operation)

Architecture Improvements

  • Introduced AtomicOpBaseNode<Derived> template base class using CRTP pattern
  • Shared implementation for MakeIterVars, MakeIndices, MakePredicate, MakeSIMTLoop, Lower, and InferLayout
  • AtomicAddNode, AtomicMaxNode, and AtomicMinNode all inherit from this base class

Test plan

  • test_atomic_add - Basic atomic add
  • test_atomic_max - Atomic maximum
  • test_atomic_min - Atomic minimum
  • test_atomic_load_store - Atomic load/store operations
  • test_atomic_memory_order - Memory ordering support
  • test_atomic_addx2_half - Vectorized atomic add for float16
  • test_atomic_addx2_float - Vectorized atomic add for float32
  • test_atomic_addx4 - 4-wide vectorized atomic add
  • test_atomic_return_prev - Return previous value
  • test_tile_atomic_add - Tile-region atomic add
  • test_tile_atomic_max - Tile-region atomic max
  • test_tile_atomic_min - Tile-region atomic min
  • test_tma_atomic_add - TMA-based atomic add

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Tile-level atomic max/min and tensor-region atomic-add with optional return-value and 2x/4x vectorized variants.
  • Vectorization

    • Target- and layout-aware vectorization for atomic ops, choosing vector widths based on data type and device capability.
  • Codegen

    • CUDA and HIP backends emit the new atomic intrinsics (returning and non-returning variants) and honor optional memory-order hints.
  • Tests

    • Added tests exercising tile-level atomic max, min, and atomic-add paths.

✏️ Tip: You can customize this high-level summary in your review settings.

…d functionality

- Added support for atomic max and min operations with corresponding classes and methods.
- Introduced vectorization capabilities for atomic add operations based on data type and target architecture.
- Refactored atomic add intrinsic calls to improve clarity and consistency.
- Enhanced layout inference for atomic operations and integrated new utility functions for target architecture checks.
- Updated tests to validate the new atomic operations and their vectorization behavior.
- Cleaned up whitespace and formatting in atomic add and reduce implementations for better code clarity.
- Enhanced comments in CUDA code generation for atomic operations to improve understanding.
- Updated function signatures and layout definitions for consistency across atomic operation files.
- Ensured proper alignment of code for better maintainability and readability.
@github-actions
Copy link

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 15, 2026

Caution

Review failed

The pull request is closed.

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

📝 Walkthrough

Walkthrough

Refactors atomic ops and vectorization: adds an AtomicOpBaseNode and AtomicMax/AtomicMin, reworks AtomicAdd lowering to use ParallelOp + multi-level layout inference, replaces planner-driven atomic_add vectorization with an inline rewriter, threads LayoutMap through vectorize/lowering, adds TL builtins and CUDA/HIP codegen, and updates tests.

Changes

Cohort / File(s) Summary
Atomic base & new ops
src/op/atomic_reduce.h, src/op/atomic_reduce.cc
Add AtomicOpBaseNode (CRTP) with shared helpers (MakeIterVars/MakeIndices/MakePredicate/MakeSIMTLoop/InferLayout/Lower); implement AtomicMax/AtomicMin, Clone/GetElemOp, register TL ops, and instantiate templates.
AtomicAdd refactor
src/op/atomic_add.h, src/op/atomic_add.cc
Make AtomicAddNode derive from AtomicOpBaseNode; remove legacy planner scaffolding; add GetElemOp() and GetVectorizeLength(Target); Lower now builds a ParallelOp, runs multi-level layout inference, then calls LowerParallelLoop.
Builtins: intrinsics added/renamed
src/op/builtin.h, src/op/builtin.cc
Rename atomicadd_elem_opatomic_add_elem_op; add atomic_add_ret_elem_op, atomic_addx2/x4, atomic_load/store, atomic_max/min and _ret variants; set arities and kOpaque effects.
AtomicAdd vectorize rewriter
src/transform/atomicadd_vectorize.h, src/transform/atomicadd_vectorize.cc
Replace planner with AtomicAddVectorizeRewriter, add ExtractBufferLoad and GetVectorizedAtomicOp; change VectorizeAtomicAdd signature and rewrite loop bodies to emit vectorized atomics when applicable.
Vectorization plumbing
src/transform/loop_vectorize.h, src/transform/loop_vectorize.cc
Thread LayoutMap through GetVectorizeSize/VectorizeLoop/VectorizePlanner; add IsBufferContiguous; apply layout-aware index transforms and SM/version-aware vector-width decisions.
Lowering / loop partition updates
src/transform/loop_partition.h, src/transform/loop_partition.cc, src/transform/lower_tile_op.cc
Extend LowerParallelLoop to accept const LayoutMap &layout_map; invoke VectorizeAtomicAdd during lowering; propagate layout_map_ from LowerTileOpPass.
Parallel / target utilities
src/op/parallel.cc, src/target/utils.h, src/target/utils.cc
Use IsSharedBuffer/IsGlobalBuffer for scope checks; change vector-size computation to GetVectorizeSize(..., layout_map); add TargetHasSMVersionGE(Target,int).
CUDA / HIP codegen
src/target/codegen_cuda.cc, src/target/codegen_hip.cc
Emit new TL intrinsics: atomic_add variants (including x2/x4 and _ret), atomic_load/store, atomic_max/min (and _ret), handling optional memory_order and return-value variants.
Language layer & tests
tilelang/language/atomic.py, testing/python/language/test_tilelang_language_atomic.py
Language: add tensor-region (tileop) and scalar paths using TL intrinsics and memory_order; tests: add tile-level atomic_max/atomic_min programs, run helpers, and kernel-source prints.
Call-site updates
src/op/copy.cc, src/op/fill.cc, src/transform/loop_vectorize.cc, src/transform/loop_partition.cc, src/transform/lower_tile_op.cc
Thread layout_map through VectorizeLoop/GetVectorizeSize/LowerParallelLoop call sites and adjust lowering order to include atomic-add vectorization.
Legalize / safety
src/transform/legalize_safe_memory_access.cc, testing/python/transform/test_tilelang_transform_legalize_safe_memory_access.py
Treat atomic ops like extern calls for recursive condition collection; update test to use in-language T.atomic_add instead of direct extern handle call.

Sequence Diagram(s)

sequenceDiagram
    participant User as User Code
    participant Lang as tilelang (atomic.py)
    participant TileOp as TL TileOp / ElemOp
    participant Infer as ParallelOp InferLayout
    participant Lower as LowerParallelLoop
    participant Vect as AtomicAddVectorizeRewriter
    participant Codegen as CUDA/HIP Codegen

    User->>Lang: emit atomic_* (scalar or tensor)
    Lang->>Lang: choose scalar vs tile-region path
    alt Tile-region
        Lang->>TileOp: create tl.tileop.atomic*
        TileOp->>Infer: call InferLayout via ParallelOp pipeline
        Infer->>Lower: call LowerParallelLoop (provides layout_map)
    else Scalar
        Lang->>Codegen: emit tl.atomic_*_elem_op call
    end
    Lower->>Vect: pass lowered loop IR (may contain atomic_add_elem_op)
    Vect->>Vect: detect loop extent & decide vector width (dtype, SM)
    alt Vectorizable
        Vect->>TileOp: rewrite atomic_add -> atomic_addx2/x4
    else Non-vectorizable
        Vect->>TileOp: leave atomic_add_elem_op
    end
    TileOp->>Codegen: emit target intrinsics (handle memory_order)
    Codegen->>User: kernel source / binary
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

Suggested reviewers

  • Rachmanino

Poem

🐇 I hop through loops and lowered lands,
I balance Max and Min with tiny hands.
A rewriter trims the tangled maze,
ParallelOps now guide the phase,
Vector hops — x2, x4 — and kernels dance.

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 32.35% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main changes: atomic reduction operations (max/min) and vectorization enhancements for atomic add.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings


📜 Recent review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1169199 and ec37e0d.

📒 Files selected for processing (1)
  • src/transform/loop_vectorize.cc

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces atomic reduction operations (atomic_max, atomic_min) and enhances atomic operations with automatic vectorization support, along with significant code architecture improvements using the CRTP pattern.

Changes:

  • Added new atomic reduction operations (atomic_max, atomic_min) with both element-wise and tile-region-based modes
  • Implemented automatic vectorization of atomic_add_elem_op to vectorized variants (atomic_addx2_elem_op, atomic_addx4_elem_op)
  • Refactored atomic operation architecture using CRTP pattern with AtomicOpBaseNode<T> base class to reduce code duplication

Reviewed changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
tilelang/language/atomic.py Added atomic_max and atomic_min Python APIs with scalar and tile-region support; migrated from call_extern to call_intrin
src/op/atomic_reduce.h Introduced AtomicOpBaseNode CRTP base class and AtomicMax/AtomicMin node classes
src/op/atomic_reduce.cc Implementation of atomic reduction operations with shared base functionality
src/op/atomic_add.h Refactored AtomicAddNode to inherit from AtomicOpBaseNode
src/op/atomic_add.cc Simplified atomic add implementation using base class methods
src/op/builtin.h Added new atomic operation intrinsic declarations
src/op/builtin.cc Registered new atomic operation intrinsics
src/transform/atomicadd_vectorize.h Simplified header with new vectorization API
src/transform/atomicadd_vectorize.cc Complete rewrite of atomic add vectorization using vectorized loop detection
src/transform/loop_vectorize.cc Added vectorization planning support for atomic_add_elem_op
src/transform/loop_partition.cc Integrated atomic add vectorization pass
src/target/codegen_cuda.cc Added CUDA codegen for all new atomic intrinsics
src/target/utils.h Added TargetHasSMVersionGE utility function
src/target/utils.cc Implementation of SM version check utility
src/op/parallel.cc Refactored to use IsSharedBuffer/IsGlobalBuffer utilities
testing/python/language/test_tilelang_language_atomic.py Added comprehensive tests for tile-level atomic max/min operations
Comments suppressed due to low confidence (1)

testing/python/language/test_tilelang_language_atomic.py:29

  • The print statement was added for debugging purposes. In production test code, consider removing this or wrapping it in a debug flag to avoid cluttering test output.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/target/utils.cc (1)

63-63: Pre-existing bug: Bitwise AND instead of logical AND.

Line 63 uses & (bitwise AND) instead of && (logical AND):

return arch >= 100 & arch <= 110;

Due to operator precedence, this evaluates as arch >= (100 & arch) <= 110, which is incorrect. This is outside the scope of the current PR changes but could cause incorrect SM100 detection.

Proposed fix
-  return arch >= 100 & arch <= 110;
+  return arch >= 100 && arch <= 110;
🤖 Fix all issues with AI agents
In `@src/op/atomic_add.h`:
- Around line 69-70: The declaration of LowerTMA is unused: either remove the
unused method declaration Stmt LowerTMA(const LowerArgs &T, arith::Analyzer
*analyzer) const from the class, or extract the TMA-specific lowering block
currently embedded in Lower() into a new method named LowerTMA and have Lower()
call LowerTMA; ensure the signature matches the declaration and move all
TMA-related variables and logic from Lower() (the code between the TMA handling
boundaries) into LowerTMA so there are no duplicate implementations or dead
declarations.

In `@src/op/atomic_reduce.cc`:
- Around line 159-162: The scalar fast-path incorrectly uses a single index {0}
for BufferLoad/BufferStore even when the buffers are multi-dimensional; update
the branch where is_scalar is true to build the full index list instead of
hardcoding {0} — e.g., call the existing MakeIndices helper (or construct a
vector of zeros of length dst->ndim using ranges[i]->min) to produce one index
per dimension and pass that to BufferLoad(src, indices) and BufferStore(dst,
value, indices), leaving the surrounding For(Var("i"), ...) and returns intact.

In `@src/transform/atomicadd_vectorize.cc`:
- Around line 139-146: The vectorized atomic call currently drops annotations
(e.g., memory_order) from the original node; preserve node->annotations when
constructing the vectorized Call returned by GetVectorizedAtomicOp(vector_size_)
so the memory ordering and other metadata are carried over (i.e., include
node->annotations on the Call that uses addr_dst and addr_src and
node->dtype/GetVectorizedAtomicOp).

In `@src/transform/loop_vectorize.cc`:
- Around line 151-182: The code incorrectly calls Downcast<PrimExpr>(node) with
a raw const CallNode*; replace those calls with ffi::GetRef<PrimExpr>(node) so
the raw pointer is converted to a proper ObjectRef/smart pointer before
returning; update both occurrences (the return in the atomic_add_elem_op branch
and the return in the "Other calls" branch) to use ffi::GetRef<PrimExpr>(node)
consistent with the pattern used elsewhere (e.g., ffi::GetRef<Stmt>(node)) and
leave the surrounding logic (vector_size_ updates and early returns) unchanged.

In `@tilelang/language/atomic.py`:
- Line 376: The return_type determination in atomic_addx4 uses a string-based
check ("float" in str(dst.dtype).lower()) which is inconsistent with
atomic_addx2; update atomic_addx4 to use the same dtype-based check as
atomic_addx2—inspect dst.dtype directly to decide float vs handle (i.e., mirror
the exact condition used in atomic_addx2) so return_type is set consistently for
float and non-float types.
- Around line 375-377: The atomic_addx4 implementation ignores the return_prev
flag because atomic_addx4_op is set to the same op in both branches; update the
selection in atomic_addx4 (variable atomic_addx4_op) to pick the return-value
variant when return_prev is true (e.g.
op.Op.get("tl.atomic_addx4_elem_op_return") or whatever naming is used for
return-value ops) and use the existing non-return op otherwise, and if a
return-value variant is not present, raise NotImplementedError to make the
limitation explicit.
- Around line 337-339: The code in atomic_addx2 and atomic_addx4 ignores the
return_prev flag because atomic_addx2_op/atomic_addx4_op selection always picks
the same op; update the selection to mirror atomic_add by choosing
"tl.atomic_addx2_ret_elem_op" / "tl.atomic_addx4_ret_elem_op" when return_prev
is True and the non-_ret variants otherwise, but since those _ret ops are not
declared you must either (A) add the corresponding builtin op
declarations/definitions (tl.atomic_addx2_ret_elem_op and
tl.atomic_addx4_ret_elem_op) to the backend builtins (src/op/builtin.h and
src/op/builtin.cc) or (B) follow the atomic_add tile-region pattern and raise
NotImplementedError when return_prev is True in atomic_addx2 and atomic_addx4;
adjust return_type and T.call_intrin usage accordingly and reference the symbols
atomic_addx2_op, atomic_addx4_op, atomic_addx2_ret_elem_op,
atomic_addx4_ret_elem_op, and the functions atomic_addx2 / atomic_addx4 when
making the change.
🧹 Nitpick comments (3)
tilelang/language/atomic.py (1)

60-68: Consider extracting duplicated get_extent helper.

The get_extent function is defined identically in atomic_max, atomic_min, and atomic_add. Consider extracting it to a module-level helper or importing from copy_op.py where a similar helper already exists (see tilelang/language/copy_op.py lines 63-76).

♻️ Suggested refactor
+def _get_extent(data):
+    """Return the inferred extent (shape) of a buffer-like object."""
+    if isinstance(data, Var) and T.has_let_value(data):
+        data = T.get_let_value(data)
+    if isinstance(data, Buffer):
+        return data.shape
+    elif isinstance(data, BufferRegion):
+        return [x.extent for x in data.region]
+    else:
+        return None
+
+
 def atomic_max(dst: Buffer, value: PrimExpr, memory_order: str | None = None, return_prev: bool = False) -> PrimExpr:
     # ... docstring ...
-
-    def get_extent(data):
-        if isinstance(data, Var) and T.has_let_value(data):
-            data = T.get_let_value(data)
-        if isinstance(data, Buffer):
-            return data.shape
-        elif isinstance(data, BufferRegion):
-            return [x.extent for x in data.region]
-        else:
-            return None
-
-    src_extent = get_extent(value)
-    dst_extent = get_extent(dst)
+    src_extent = _get_extent(value)
+    dst_extent = _get_extent(dst)

Apply the same pattern to atomic_min and atomic_add.

Also applies to: 147-155, 239-257

testing/python/language/test_tilelang_language_atomic.py (1)

29-29: Consider removing debug print statements before merge.

While useful during development, print(kernel.get_kernel_source()) statements can clutter CI logs. Consider removing them or gating behind a verbose flag if they're not needed for the final tests.

src/op/atomic_add.cc (1)

90-100: Consider unifying vectorization length logic with atomicadd_vectorize.cc.

This logic duplicates AtomicAddVectorizeRewriter::GetMaxVectorSize in src/transform/atomicadd_vectorize.cc (lines 64-73). Consider extracting to a shared utility to avoid divergence.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2d8d367 and b911466.

📒 Files selected for processing (16)
  • src/op/atomic_add.cc
  • src/op/atomic_add.h
  • src/op/atomic_reduce.cc
  • src/op/atomic_reduce.h
  • src/op/builtin.cc
  • src/op/builtin.h
  • src/op/parallel.cc
  • src/target/codegen_cuda.cc
  • src/target/utils.cc
  • src/target/utils.h
  • src/transform/atomicadd_vectorize.cc
  • src/transform/atomicadd_vectorize.h
  • src/transform/loop_partition.cc
  • src/transform/loop_vectorize.cc
  • testing/python/language/test_tilelang_language_atomic.py
  • tilelang/language/atomic.py
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2026-01-06T05:20:51.649Z
Learnt from: silentCoder-dev
Repo: tile-ai/tilelang PR: 1606
File: testing/python/transform/test_tilelang_transform_hoist_broadcast_values.py:30-30
Timestamp: 2026-01-06T05:20:51.649Z
Learning: In `testing/python/transform/test_tilelang_transform_hoist_broadcast_values.py`, the test validates that the `hoist_broadcast_values` transformation pass correctly identifies and hoists broadcast operations by checking for patterns in the generated kernel source code. The specific literal values used (e.g., 430) are not important for the test's purpose, as it does not validate numerical precision or actual stored tensor values.

Applied to files:

  • testing/python/language/test_tilelang_language_atomic.py
📚 Learning: 2025-12-18T04:50:00.512Z
Learnt from: silentCoder-dev
Repo: tile-ai/tilelang PR: 1464
File: testing/python/language/test_tilelang_language_rand.py:14-14
Timestamp: 2025-12-18T04:50:00.512Z
Learning: In `testing/python/language/test_tilelang_language_rand.py`, the TileLang kernel uses `blk_M = M` (single block) and calls `rng_rand()` four times per element to align results with the Triton implementation, which uses `blk_M = 128` (multiple blocks) and calls the RNG once per element. These differences compensate for internal RNG behavior differences between TileLang and Triton.

Applied to files:

  • testing/python/language/test_tilelang_language_atomic.py
📚 Learning: 2025-11-14T07:56:11.098Z
Learnt from: lucifer1004
Repo: tile-ai/tilelang PR: 1256
File: testing/python/jit/test_tilelang_jit_gemm_nvrtc.py:55-115
Timestamp: 2025-11-14T07:56:11.098Z
Learning: In `testing/python/jit/test_tilelang_jit_gemm_nvrtc.py`, the global function `tilelang_callback_cuda_postproc` registered via `tvm.register_global_func(..., override=True)` is intentionally not restored after the test completes, as the persistent behavior is expected.

Applied to files:

  • testing/python/language/test_tilelang_language_atomic.py
🧬 Code graph analysis (8)
src/target/utils.h (1)
src/target/utils.cc (2)
  • TargetHasSMVersionGE (130-135)
  • TargetHasSMVersionGE (130-130)
src/transform/loop_partition.cc (1)
src/transform/atomicadd_vectorize.cc (2)
  • VectorizeAtomicAdd (157-160)
  • VectorizeAtomicAdd (157-157)
src/target/utils.cc (2)
src/op/gemm.cc (2)
  • GetArchInt (413-424)
  • GetArchInt (413-413)
src/op/gemm_py.cc (2)
  • GetArchInt (230-241)
  • GetArchInt (230-230)
src/transform/atomicadd_vectorize.h (1)
src/transform/atomicadd_vectorize.cc (2)
  • VectorizeAtomicAdd (157-160)
  • VectorizeAtomicAdd (157-157)
src/op/atomic_add.cc (3)
src/op/atomic_reduce.cc (4)
  • GetElemOpStatic (52-52)
  • GetElemOpStatic (52-52)
  • GetElemOpStatic (81-81)
  • GetElemOpStatic (81-81)
src/transform/atomicadd_vectorize.cc (2)
  • dtype (64-73)
  • dtype (64-64)
src/target/utils.cc (2)
  • TargetHasSMVersionGE (130-135)
  • TargetHasSMVersionGE (130-130)
src/op/parallel.cc (1)
src/op/utils.h (1)
  • IsGlobalBuffer (55-57)
tilelang/language/atomic.py (5)
tilelang/language/copy_op.py (1)
  • get_extent (64-77)
tilelang/language/frame.py (2)
  • has_let_value (188-197)
  • get_let_value (200-209)
tilelang/language/utils.py (1)
  • region (7-10)
tilelang/language/tir/op.py (1)
  • call_intrin (120-145)
tilelang/utils/language.py (2)
  • legalize_pairwise_extents (406-449)
  • to_buffer_region (194-237)
testing/python/language/test_tilelang_language_atomic.py (2)
tilelang/transform/pass_config.py (1)
  • PassConfigKey (6-163)
tilelang/language/atomic.py (2)
  • atomic_max (20-104)
  • atomic_min (107-191)
🪛 Cppcheck (2.19.0)
src/op/atomic_reduce.cc

[error] 281-281: syntax error

(syntaxError)

🪛 Ruff (0.14.11)
tilelang/language/atomic.py

337-337: Useless if-else condition

(RUF034)


375-375: Useless if-else condition

(RUF034)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Quick Lint
🔇 Additional comments (35)
src/op/builtin.h (1)

588-658: LGTM! Well-organized atomic operation declarations.

The new atomic operation declarations follow a consistent naming convention and are well-documented. The separation between regular operations (atomic_*_elem_op) and return-value variants (atomic_*_ret_elem_op) is clean.

Note: The vectorized variants (atomic_addx2_elem_op, atomic_addx4_elem_op) don't have corresponding _ret_ versions. This is consistent with the code in atomic.py, but as noted in that file's review, the return_prev parameter in atomic_addx2/atomic_addx4 should either be implemented or raise NotImplementedError.

src/transform/atomicadd_vectorize.h (1)

19-29: LGTM! Clean API simplification.

The simplified interface is much cleaner than the previous planner-based approach. The function signature now relies on Target::Current() internally (as shown in the implementation at src/transform/atomicadd_vectorize.cc lines 156-159), which is appropriate for this use case.

src/op/atomic_reduce.h (2)

26-71: Well-designed CRTP base class for atomic operations.

The AtomicOpBaseNode<Derived> template provides a clean foundation for atomic operations with:

  • Common buffer/range management
  • Shared lowering and layout inference hooks
  • Memory order extraction from annotations
  • CRTP pattern for element-wise operation dispatch via GetElemOp()

The mutable qualifier on par_op_ is appropriate for caching during const operations.


77-139: LGTM! Consistent AtomicMax and AtomicMin node implementations.

Both AtomicMaxNode and AtomicMinNode follow the same structure with:

  • FFI reflection declarations
  • Static Get() and GetElemOpStatic() accessors returning atomic_max_elem_op() and atomic_min_elem_op() respectively
  • Clone() implementation
  • RegisterReflection() for property access
src/target/utils.h (1)

32-32: LGTM! Useful utility for SM version checks.

The TargetHasSMVersionGE function provides a clean way to check CUDA SM version compatibility. The implementation correctly handles non-CUDA targets by returning false.

src/transform/loop_partition.cc (2)

32-32: LGTM!

The new include for atomic add vectorization is appropriately placed with other transform headers.


300-302: LGTM!

The atomic add vectorization step is correctly positioned after general loop vectorization (Step 2) and before predicate wrapping (Step 4). This ordering ensures that vectorized atomic operations benefit from the loop vectorization pass while the predicate guard is still applied correctly to the final transformed loop.

src/target/utils.cc (1)

130-135: LGTM!

The TargetHasSMVersionGE function follows the established pattern for target capability checks in this file. The implementation correctly handles non-CUDA targets and delegates to GetArchInt for architecture comparison.

src/op/parallel.cc (2)

12-12: LGTM!

The include is appropriately added for arithmetic helper utilities.


457-461: LGTM!

The refactoring from string-based scope checks to helper predicates (IsSharedBuffer, IsGlobalBuffer) improves code maintainability and consistency with similar changes in loop_vectorize.cc. The logic remains functionally equivalent while being more readable.

Also applies to: 476-476

src/transform/loop_vectorize.cc (2)

27-27: LGTM!

The include for ../op/utils.h is needed for the IsSharedBuffer and IsGlobalBuffer helper predicates.


122-123: LGTM!

Consistent refactoring to use helper predicates instead of inline string comparisons for buffer scope checks.

Also applies to: 137-138

src/target/codegen_cuda.cc (5)

2891-2900: LGTM!

The atomic_add_elem_op codegen correctly emits AtomicAdd(dst, src[, memory_order]) with optional memory order parameter. The statement-context pattern (PrintIndent + stream) is appropriate.


2901-2909: LGTM!

The atomic_add_ret_elem_op correctly outputs to os (expression context) since it returns the previous value, matching the pattern for other return-value operations.


2910-2929: LGTM!

The vectorized atomic add variants (atomic_addx2_elem_op, atomic_addx4_elem_op) correctly emit AtomicAddx2/AtomicAddx4 calls with consistent handling of the optional memory order parameter.


2930-2941: LGTM!

The atomic_load_elem_op and atomic_store_elem_op handlers are correctly implemented:

  • atomic_load outputs to os (returns loaded value)
  • atomic_store uses PrintIndent + stream (statement context)
  • Both require the memory_order parameter as expected for explicit memory ordering operations.

2942-2979: LGTM!

The atomic_max_elem_op, atomic_max_ret_elem_op, atomic_min_elem_op, and atomic_min_ret_elem_op handlers follow the same consistent patterns as the atomic add operations:

  • Non-return variants use statement context (PrintIndent + stream)
  • Return variants use expression context (os)
  • Memory order is optional for all max/min operations
testing/python/language/test_tilelang_language_atomic.py (3)

7-12: LGTM! Pass configs correctly disable optimizations.

The explicit disabling of TMA lowering and warp specialization for the scalar atomic_add_program test ensures deterministic behavior and isolates the core atomic_add functionality being tested.


395-429: LGTM! Tile-level atomic max test correctly implemented.

The test properly initializes B with float("-inf") for max reduction, and the reference program correctly computes element-wise maximum across the K dimension. The kernel structure is consistent with the existing tile_atomic_add_program pattern.


432-466: LGTM! Tile-level atomic min test correctly implemented.

The test properly initializes B with float("inf") for min reduction, and the reference program correctly computes element-wise minimum across the K dimension. The implementation is symmetric with tile_atomic_max and follows established patterns.

src/transform/atomicadd_vectorize.cc (3)

22-34: LGTM! Buffer load extraction helper is well-implemented.

The helper correctly handles both direct BufferLoad expressions and those wrapped in address_of calls, returning an Optional for clean error handling.


39-48: LGTM! Vector size to op mapping is correct.

The mapping correctly handles x4 (SM>=90 float32), x2 (float16/bfloat16), and falls back to scalar for unsupported cases. The default case provides a safe fallback.


157-160: LGTM! Entry point handles undefined target gracefully.

Target::Current(false) returns an undefined target if no context is set, which is safe since TargetHasSMVersionGE returns false for non-CUDA targets (per src/target/utils.cc), disabling SM>=90 float32 vectorization in that case.

src/op/atomic_add.cc (3)

77-77: LGTM! Static element op accessor follows established pattern.

This is consistent with AtomicMaxNode::GetElemOpStatic() in src/op/atomic_reduce.cc (line 51 from relevant snippets).


561-583: LGTM! Non-TMA lowering pipeline is well-structured.

The pipeline correctly creates a SIMT loop, fuses parallel loops, runs multi-level layout inference, and lowers with proper predicate handling via ParallelOp. The three-level inference (kCommon, kStrict, kFree) ensures robust layout selection.


146-149: Scalar case intentionally uses non-atomic BufferStore for single-threaded execution.

The scalar case (when loop_vars is empty) returns a For loop with ForKind::kSerial and extent 1, which guarantees single-threaded execution. This is intentional design—the comment on line 141 explicitly states "For scalar cases a serial For of extent 1 is returned." When only one thread can execute the code (single-threaded loop), atomic semantics are unnecessary. This pattern is consistently used across atomic_add.cc, atomic_reduce.cc, and copy.cc for all scalar operations.

Likely an incorrect or invalid review comment.

src/op/builtin.cc (1)

354-402: LGTM! Atomic operation builtins are correctly defined.

The new atomic builtins follow consistent naming conventions (atomic_<op>_elem_op, atomic_<op>_ret_elem_op for return-previous variants) and correctly use kOpaque effect kind for operations with memory side effects. Input counts are appropriate:

  • 3 inputs for add/max/min (dst, src, memory_order)
  • 2 inputs for load (src, memory_order)
  • 3 inputs for store (dst, value, memory_order)
src/op/atomic_add.h (1)

16-71: LGTM! AtomicAddNode class design is clean and well-documented.

The CRTP inheritance from AtomicOpBaseNode<AtomicAddNode> enables code sharing with AtomicMaxNode/AtomicMinNode. The public API is minimal, protected helpers are appropriately scoped, and the brief doc comments clarify each method's purpose.

src/op/atomic_reduce.cc (7)

29-81: LGTM!

The AtomicMax and AtomicMin implementations follow a consistent pattern with proper buffer region handling, clone support for parallel operations, and appropriate element-op accessors. The structural similarity between the two is acceptable given the CRTP base class handles the complex shared logic.


102-121: LGTM!

The index construction correctly handles unit-extent dimensions by using the range minimum, while properly incorporating iteration variables for non-trivial dimensions. The ICHECK at the end provides useful diagnostic information when there's a mismatch.


123-153: LGTM!

The predicate generation correctly handles bounds checking with symbolic analysis to eliminate provably-true conditions. The return of an empty PrimExpr for the no-predicate case follows TVM conventions.


164-214: LGTM!

The non-scalar path correctly:

  • Binds iteration variables for analysis
  • Constructs proper indices via MakeIndices
  • Handles dtype casting between src and dst
  • Uses CRTP to retrieve the memory order and element op from the derived type
  • Builds nested parallel loops with appropriate annotations

216-233: LGTM!

The layout inference correctly enforces that source and destination fragments must have matching layouts for atomic reduce operations, which is a necessary constraint for correctness. Returning an empty LayoutMap is appropriate since these operations don't introduce new layout requirements.


235-260: LGTM!

The lowering implementation follows the established pattern: SIMT loop generation → fusion → parallel op construction → multi-level layout inference → final lowering with predicates. The three-level layout inference (kCommon, kStrict, kFree) provides robust layout handling.


262-284: LGTM!

Template instantiations are properly declared for all three atomic operation types. The operator registrations correctly specify:

  • 2 inputs (src and dst regions)
  • kOpaque call effect kind, appropriate for operations with side effects

Regarding the static analysis hint at line 281: This is a false positive. TVM_FFI_STATIC_INIT_BLOCK() is a TVM macro that expands to a valid static initialization construct. The analyzer doesn't parse the macro expansion.

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

…ions

- Updated function signatures in atomic add, atomic reduce, copy, and fill operations to include layout map parameters for improved layout handling.
- Refactored vectorization logic to utilize layout maps, ensuring better performance and compatibility with various buffer layouts.
- Enhanced the LowerParallelLoop function to accept layout maps, facilitating more efficient loop transformations.
- Added checks for buffer contiguity in vectorization processes to ensure correctness when using layout maps.
- Updated tests to validate the new layout handling and vectorization behavior.
- Introduced a new method to identify atomic operations within the legalize_safe_memory_access transformation.
- Updated the VisitStmt_ function to handle both CallExtern and atomic operations, ensuring recursive condition collection for these cases.
- Enhanced comments for clarity on the handling of atomic operations in the context of memory access legality.
- Implemented support for various atomic operations including atomic add, atomic load, atomic store, atomic max, and atomic min in the HIP code generation.
- Enhanced the handling of atomic operations to include optional memory order parameters.
- Improved code readability with added comments explaining the purpose of each atomic operation.
- Ensured consistency with existing atomic operation implementations in the codebase.
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@src/op/atomic_reduce.cc`:
- Around line 30-43: The AtomicMax::AtomicMax constructor assumes args has at
least two elements and dereferences args[0]/args[1] without checks; add a
precondition check (e.g., DCHECK/ICHECK/CHECK_GE or throw a descriptive
std::runtime_error) verifying args.size() >= 2 at the start of
AtomicMax::AtomicMax, and return/raise an error with a clear message if the
condition fails; keep the rest of the logic (calling NormalizeToBufferRegion,
populating rgs/bf, and assigning node fields) unchanged.

In `@src/target/codegen_hip.cc`:
- Around line 997-1035: The codegen emits calls to missing HIP template helpers:
add templated functions AtomicLoad(src_ptr, memory_order), AtomicStore(dst_ptr,
value, memory_order), AtomicMaxRet(address, val, memory_order), and
AtomicMinRet(address, val, memory_order) into the HIP atomic template header
(matching the style and naming of existing
AtomicAdd/AtomicAddRet/AtomicMax/AtomicMin/AtomicAddx2* functions); ensure
AtomicLoad returns the loaded value, AtomicStore is void, and
AtomicMaxRet/AtomicMinRet return the previous value, accept the same
pointer/value types as AtomicMax/AtomicMin, accept an optional memory_order
parameter, and implement them using the same atomic primitive patterns (or
__atomic builtins) and template signatures as the existing Atomic* helpers so
generated code from atomic_load_elem_op, atomic_store_elem_op,
atomic_max_ret_elem_op, and atomic_min_ret_elem_op will compile.
♻️ Duplicate comments (1)
src/op/atomic_reduce.cc (1)

159-163: Scalar case still uses incorrect hardcoded indices.

This issue was flagged in a previous review: when loop_vars is empty but the buffer is multi-dimensional (e.g., shape (1, 1, 1)), using {0} provides only one index instead of the required three. The fix should use MakeIndices to construct proper indices for all dimensions.

🧹 Nitpick comments (2)
src/op/atomic_reduce.cc (1)

59-72: Same missing bounds check; consider extracting shared constructor logic.

This constructor duplicates lines 30-43 verbatim. Consider extracting a helper (e.g., InitAtomicOpNode) to reduce duplication and add the bounds check once.

♻️ Sketch of helper extraction
// Helper to initialize common atomic op node fields
template <typename NodeType>
void InitAtomicOpNode(ObjectPtr<NodeType>& node, const Array<PrimExpr>& args,
                      const Map<String, ObjectRef>& annotations) {
  ICHECK(args.size() >= 2) << "Atomic op requires at least 2 arguments";
  Array<Range> rgs[2];
  Buffer bf[2];
  for (int i = 0; i < 2; i++) {
    auto region = NormalizeToBufferRegion(args[i]);
    rgs[i] = region->region;
    bf[i] = region->buffer;
  }
  std::tie(node->src, node->dst) = std::tie(bf[0], bf[1]);
  std::tie(node->src_range, node->dst_range) = std::tie(rgs[0], rgs[1]);
  node->annotations = annotations;
}
src/target/codegen_hip.cc (1)

1036-1047: Consider adding argument count validation for load/store operations.

Unlike other atomic operations that check op->args.size() > 2 for optional memory_order, atomic_load_elem_op and atomic_store_elem_op directly access indices without validation. Adding ICHECK assertions would improve debuggability if malformed IR is ever passed.

Suggested defensive checks
   } else if (op->op.same_as(tl::atomic_load_elem_op())) {
     // atomic_load_elem_op(src_ptr, memory_order) -> returns loaded value
+    ICHECK_EQ(op->args.size(), 2U)
+        << "atomic_load_elem_op expects 2 arguments (src_ptr, memory_order)";
     os << "AtomicLoad(" << PrintExpr(op->args[0]) << ", "
        << PrintExpr(op->args[1]) << ")";
   } else if (op->op.same_as(tl::atomic_store_elem_op())) {
     // atomic_store_elem_op(dst_ptr, value, memory_order)
+    ICHECK_EQ(op->args.size(), 3U)
+        << "atomic_store_elem_op expects 3 arguments (dst_ptr, value, memory_order)";
     std::string dst_ptr = PrintExpr(op->args[0]);
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c1ef4c9 and 8eb2a29.

📒 Files selected for processing (2)
  • src/op/atomic_reduce.cc
  • src/target/codegen_hip.cc
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2026-01-12T07:25:35.591Z
Learnt from: silentCoder-dev
Repo: tile-ai/tilelang PR: 1631
File: src/transform/thread_storage_sync.cc:1126-1137
Timestamp: 2026-01-12T07:25:35.591Z
Learning: In TileLang's thread storage synchronization pass (src/transform/thread_storage_sync.cc), at the IR level where PointerAccessIsDisjoint is called, the threads array in AccessEntry is guaranteed to contain all three thread dimensions (threadIdx.x, threadIdx.y, threadIdx.z), making access to the last 3 elements via `threads[threads.size() + idx - 3]` safe.

Applied to files:

  • src/op/atomic_reduce.cc
🧬 Code graph analysis (1)
src/op/atomic_reduce.cc (1)
src/op/atomic_reduce.h (1)
  • AtomicMax (99-107)
🪛 Cppcheck (2.19.0)
src/op/atomic_reduce.cc

[error] 282-282: syntax error

(syntaxError)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: Test for Python 3.12 with Nightly-ROCm-7.1 (on self-hosted-amd)
  • GitHub Check: Test for Python 3.12 with Metal (on macos-latest)
🔇 Additional comments (8)
src/op/atomic_reduce.cc (7)

88-101: LGTM - common TVM pattern for loop variable naming.

The char('i' + idx) naming scheme is a standard TVM convention. While it technically overflows for >17 non-unit dimensions, this is practically unreachable for atomic operations.


103-154: LGTM - index and predicate construction is sound.

The iteration variable tracking with idx and the ICHECK assertion ensure correct index construction. The predicate optimization using analyzer->CanProve is a good practice.


178-214: LGTM - non-scalar SIMT loop generation is well-structured.

The atomic call construction with proper dtype casting, pointer generation via address_of, and memory ordering via CRTP GetMemoryOrder() follows the expected pattern. The nested parallel loop construction with annotation propagation is correct.


217-234: InferLayout returns empty map - verify this is intentional.

The method validates that src and dst have matching layouts when both are fragments, but always returns an empty LayoutMap. This means the operator doesn't contribute any layout constraints back to the inference system. If this is intentional (relying on other operators to establish layouts), this is fine; otherwise, it may cause layout mismatches.


236-261: LGTM - lowering follows the standard ParallelOp pattern.

The three-level layout inference (kCommon → kStrict → kFree) and subsequent LowerParallelLoop integration is consistent with other tile operators in the codebase.


272-285: LGTM - operator registration and reflection initialization.

The tile operator registration with kOpaque call effect kind is appropriate for atomic operations. The static analysis "syntax error" at line 282 is a false positive—TVM_FFI_STATIC_INIT_BLOCK is a valid TVM macro for static initialization.


263-266: No ODR violation exists. The explicit instantiation of AtomicOpBaseNode<AtomicAddNode> appears only in atomic_reduce.cc:266. The atomic_add.cc file does not contain a duplicate instantiation—it only provides method implementations, operator registration, and FFI initialization. This is the correct pattern and poses no ODR risk.

Likely an incorrect or invalid review comment.

src/target/codegen_hip.cc (1)

1048-1085: LGTM!

The atomic max/min implementations follow the same established pattern as the atomic add variants. The optional memory_order handling is consistent, and the statement vs expression form handling is correct.

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

- Changed the GetElemOpStatic method to a virtual GetElemOp method in AtomicOpBaseNode, enhancing polymorphism.
- Updated AtomicAddNode, AtomicMaxNode, and AtomicMinNode to override the new GetElemOp method.
- Removed unnecessary template parameters from AtomicOpBaseNode, simplifying the class structure.
- Cleaned up includes in atomic_reduce.cc to remove unused dependencies, improving code organization.
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
src/op/atomic_add.cc (2)

146-148: Scalar case uses hardcoded single-element index array.

When loop_vars is empty (all dimensions have extent 1), the buffer may still be multi-dimensional (e.g., shape (1, 1, 1)). Using {0} provides only one index. Consider using MakeIndices to generate proper indices for all dimensions.

🐛 Proposed fix
   if (is_scalar) {
+    Array<PrimExpr> src_indices = MakeIndices(loop_vars, 0);
+    Array<PrimExpr> dst_indices = MakeIndices(loop_vars, 1);
     return For(Var("i"), 0, 1, ForKind::kSerial,
-               BufferStore(dst, BufferLoad(src, {0}), {0}));
+               BufferStore(dst, BufferLoad(src, src_indices), dst_indices));
   }

169-171: Predicates computed but never used.

src_predicate and dst_predicate are computed but not used in the atomic call or loop body. If bounds checking is needed, wrap the atomic call in an IfThenElse. Otherwise, remove these unused computations.

Option 1: Remove unused predicates
-  // Optional bounds predicates for src and dst
-  PrimExpr src_predicate = MakePredicate(analyzer, loop_vars, src->shape, 0);
-  PrimExpr dst_predicate = MakePredicate(analyzer, loop_vars, dst->shape, 1);
-
   // Load source value and cast to dst dtype if needed
Option 2: Use predicates for bounds checking
   Stmt body = tvm::tir::Evaluate(atomicadd_call);
+  
+  // Guard with bounds predicate if needed
+  PrimExpr combined_predicate;
+  if (src_predicate.defined() && dst_predicate.defined()) {
+    combined_predicate = And(src_predicate, dst_predicate);
+  } else if (src_predicate.defined()) {
+    combined_predicate = src_predicate;
+  } else if (dst_predicate.defined()) {
+    combined_predicate = dst_predicate;
+  }
+  if (combined_predicate.defined()) {
+    body = IfThenElse(combined_predicate, body);
+  }
🤖 Fix all issues with AI agents
In `@src/op/atomic_reduce.cc`:
- Around line 58-71: The AtomicMin constructor accesses args[0] and args[1]
without checking args.size(); add a bounds check at the start of
AtomicMin::AtomicMin (e.g., if (args.size() < 2) throw/assert with a clear
message) before calling NormalizeToBufferRegion, so you validate the input and
avoid out-of-bounds access when constructing the AtomicMinNode and assigning
node->src/node->dst and ranges.
♻️ Duplicate comments (4)
src/op/atomic_add.h (1)

69-70: Unused method declaration LowerTMA.

LowerTMA is declared but never defined or called. The TMA lowering logic in atomic_add.cc is directly embedded in Lower() (lines 348-560). Either remove this declaration or extract the TMA logic into this method for consistency.

Proposed fix (if not extracting TMA logic)
   /// Compute linear layout for shared tensor (used in TMA atomic add)
   Layout ComputeLinearLayout(const Buffer &shared_tensor) const;
-
-  /// Lower TMA-based atomic add
-  Stmt LowerTMA(const LowerArgs &T, arith::Analyzer *analyzer) const;
 };
src/op/atomic_reduce.h (1)

43-50: Apply suggested fix for template keyword in as<> call.

Per the past review comment, this should use val->template as<IntImmNode>() to fix the build error on some compilers.

🐛 Proposed fix
   int GetMemoryOrder() const {
     if (auto val = annotations.Get("memory_order")) {
-      if (auto int_val = val->as<IntImmNode>()) {
+      if (auto int_val = val->template as<IntImmNode>()) {
         return int_val->value;
       }
     }
     return 0;
   }
src/op/atomic_reduce.cc (2)

29-42: Missing bounds check on args array access.

The constructor accesses args[0] and args[1] without verifying that args.size() >= 2. This will cause undefined behavior if called with fewer arguments.

🐛 Proposed fix
 AtomicMax::AtomicMax(Array<PrimExpr> args, Map<String, ObjectRef> annotations) {
+  ICHECK(args.size() >= 2) << "AtomicMax requires at least 2 arguments, got " << args.size();
   ObjectPtr<AtomicMaxNode> node = tvm::ffi::make_object<AtomicMaxNode>();

154-157: Incorrect index handling for scalar case with multi-dimensional buffers.

When all dimensions have extent 1, loop_vars is empty but the buffer may still be multi-dimensional (e.g., shape (1, 1, 1)). Using hardcoded {0} provides only one index when multiple may be required.

The MakeIndices function correctly handles this by using ranges[i]->min for unit-extent dimensions, but it's bypassed here.

🐛 Proposed fix
   if (is_scalar) {
+    // Use MakeIndices to construct proper indices for all dimensions
+    Array<PrimExpr> src_indices = MakeIndices(loop_vars, 0);
+    Array<PrimExpr> dst_indices = MakeIndices(loop_vars, 1);
     return For(Var("i"), 0, 1, ForKind::kSerial,
-               BufferStore(dst, BufferLoad(src, {0}), {0}));
+               BufferStore(dst, BufferLoad(src, src_indices), dst_indices));
   }
🧹 Nitpick comments (5)
src/op/atomic_reduce.h (2)

15-15: Avoid using namespace in header files.

using namespace tir; in a header pollutes the namespace for all translation units that include this file, potentially causing name collisions and making code harder to maintain.

Consider removing this directive and qualifying names explicitly (e.g., tir::Buffer, tir::Range), or confining it to a .cc file.


34-34: Consider making par_op_ protected or private.

par_op_ is a mutable member used for internal state during lowering. Exposing it publicly allows external code to modify it unexpectedly. Consider moving it to protected to match the other internal helpers.

src/op/atomic_add.h (1)

56-57: Document vectorization length assumptions.

Consider adding a brief doc comment explaining the returned values (2 for fp16/bf16, 4 for fp32 on SM>=90, 1 otherwise) directly in the header for quick reference, rather than only in the implementation.

src/op/atomic_reduce.cc (2)

87-99: Variable naming may overflow for high-dimensional tensors.

char('i' + idx) will produce non-alphabetic characters for tensors with more than 17 dimensions (after 'z'). While unlikely in practice, consider using a more robust naming scheme.

♻️ Proposed improvement
-    Var var = Var(std::string{char('i' + idx)}, src_range[i]->extent->dtype);
+    Var var = Var("i" + std::to_string(idx), src_range[i]->extent->dtype);

211-227: Consider propagating fragment layout when only one buffer has a layout inferred.

Currently, InferLayout only validates when both src and dst have layouts already present in the map. For consistency with how other fragment-based operators like ReduceOp and CumSumOp handle layout inference, consider propagating the layout from one buffer to the other when only one has a layout defined. This would make layout inference more robust when upstream operations only partially populate the layout map.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8eb2a29 and 70d4713.

📒 Files selected for processing (4)
  • src/op/atomic_add.cc
  • src/op/atomic_add.h
  • src/op/atomic_reduce.cc
  • src/op/atomic_reduce.h
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2026-01-12T07:25:35.591Z
Learnt from: silentCoder-dev
Repo: tile-ai/tilelang PR: 1631
File: src/transform/thread_storage_sync.cc:1126-1137
Timestamp: 2026-01-12T07:25:35.591Z
Learning: In TileLang's thread storage synchronization pass (src/transform/thread_storage_sync.cc), at the IR level where PointerAccessIsDisjoint is called, the threads array in AccessEntry is guaranteed to contain all three thread dimensions (threadIdx.x, threadIdx.y, threadIdx.z), making access to the last 3 elements via `threads[threads.size() + idx - 3]` safe.

Applied to files:

  • src/op/atomic_reduce.cc
🧬 Code graph analysis (3)
src/op/atomic_reduce.h (4)
src/op/atomic_add.h (2)
  • tl (12-84)
  • GetElemOp (33-44)
src/op/builtin.h (1)
  • tl (22-717)
src/op/operator.h (2)
  • TileOperatorNode (75-85)
  • TileOperator (87-91)
src/transform/common/loop_vectorization_utils.h (1)
  • if (443-445)
src/op/atomic_add.h (2)
src/op/atomic_reduce.h (4)
  • tl (13-135)
  • AtomicOpBaseNode (24-134)
  • GetElemOp (77-88)
  • GetElemOp (109-120)
src/op/atomic_add.cc (12)
  • Lower (346-583)
  • Lower (346-346)
  • InferLayout (253-309)
  • InferLayout (253-254)
  • GetElemOp (77-77)
  • GetElemOp (77-77)
  • GetVectorizeLength (90-100)
  • GetVectorizeLength (90-90)
  • MakeSIMTLoop (143-207)
  • MakeSIMTLoop (143-143)
  • ComputeLinearLayout (219-234)
  • ComputeLinearLayout (219-219)
src/op/atomic_reduce.cc (2)
src/op/atomic_reduce.h (5)
  • AtomicMax (92-100)
  • GetElemOp (77-88)
  • GetElemOp (109-120)
  • AtomicMin (124-132)
  • GetMemoryOrder (43-89)
src/op/atomic_add.cc (6)
  • Clone (69-75)
  • Clone (69-69)
  • GetElemOp (77-77)
  • GetElemOp (77-77)
  • MakeSIMTLoop (143-207)
  • MakeSIMTLoop (143-143)
🪛 Cppcheck (2.19.0)
src/op/atomic_reduce.cc

[error] 269-269: syntax error

(syntaxError)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Test for Python 3.12 with Metal (on macos-latest)
  • GitHub Check: Test for Python 3.12 with Nightly-ROCm-7.1 (on self-hosted-amd)
  • GitHub Check: Test for Python 3.12 with CUDA-12.8 (on self-hosted-nvidia)
🔇 Additional comments (8)
src/op/atomic_reduce.h (2)

91-100: LGTM on wrapper class design.

The AtomicMax and AtomicMin wrapper classes follow TVM's standard pattern with TVM_FFI_DEFINE_OBJECT_REF_METHODS_NULLABLE, constructors accepting args and annotations, and static Get() accessors. This is consistent with AtomicAdd.

Also applies to: 123-132


71-74: No action needed.

The third argument to TVM_FFI_DECLARE_OBJECT_INFO_FINAL is correct. All TileOperatorNode-derived classes throughout the codebase—including AtomicAddNode, CopyNode, GemmNode, and others—consistently use TileOperatorNode as the third argument regardless of whether they have an intermediate base class. This is the correct FFI type hierarchy registration for TVM's object system, where the macro declares the root FFI type rather than the direct parent class.

Likely an incorrect or invalid review comment.

src/op/atomic_add.h (1)

9-9: LGTM on inheritance refactor.

The change to inherit from AtomicOpBaseNode properly consolidates common atomic operation functionality. The class documentation clearly describes the TMA support and vectorization extensions specific to AtomicAdd.

Also applies to: 16-21

src/op/atomic_add.cc (3)

373-374: global_coords may be unused in some code paths.

global_coords is computed but the variable appears to only be used inside the loop when (*inner_box_dim) != instruction_dim (line 532) and in the else branch (line 549). Verify this is intentional.


561-582: LGTM on ParallelOp-based lowering.

The non-TMA lowering path correctly:

  1. Builds a SIMT loop via MakeSIMTLoop
  2. Fuses parallel loops with ParallelLoopFuser::Fuse
  3. Runs layout inference at multiple levels (kCommon, kStrict, kFree)
  4. Lowers with LowerParallelLoop including predicate handling

This aligns with the broader PR's architectural shift to ParallelOp-based lowering.


90-100: LGTM on vectorization length logic.

The implementation correctly returns:

  • 2 for float16/bfloat16 (atomicAdd.x2)
  • 4 for float32 on SM >= 90 (atomicAdd.x4)
  • 1 for all other cases

This aligns with CUDA's vectorized atomic add capabilities.

src/op/atomic_reduce.cc (2)

229-253: LGTM on base class Lower implementation.

The AtomicOpBaseNode::Lower correctly implements the ParallelOp-based lowering pipeline:

  1. Creates SIMT loop via MakeSIMTLoop
  2. Fuses parallel loops
  3. Runs layout inference at all levels
  4. Lowers with LowerParallelLoop

This provides a clean default implementation that AtomicMax and AtomicMin inherit directly.


259-272: LGTM on operator registration.

The TIR_REGISTER_TL_TILE_OP registrations correctly set up atomicmax and atomicmin with 2 inputs and opaque call effect. The reflection registration in TVM_FFI_STATIC_INIT_BLOCK follows the established pattern.

Note: The static analysis "syntax error" at line 269 appears to be a false positive - the TVM_FFI_STATIC_INIT_BLOCK() macro is a valid TVM construct.

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

- Added checks in AtomicAdd, AtomicMax, and AtomicMin constructors to ensure at least two arguments are provided, improving error handling and user feedback.
- Removed the unused LowerTMA method declaration from the AtomicAdd class, streamlining the codebase.
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@src/op/copy.cc`:
- Around line 727-728: The return uses an undefined variable `transformed_loop`;
replace it with the fused loop variable `fused_loop` so the call becomes
VectorizeLoop(fused_loop, T.layout_map) and return the resulting
`vectorized_thread_loop` (i.e., ensure VectorizeLoop is invoked with
`fused_loop` instead of `transformed_loop` in the code path that computes and
returns `vectorized_thread_loop`).
♻️ Duplicate comments (1)
src/op/atomic_reduce.cc (1)

157-163: Scalar case still uses incorrect hardcoded indices for multi-dimensional buffers.

This issue was previously flagged. When all dimensions have unit extent, loop_vars is empty but the buffer may still require multiple indices (e.g., shape (1, 1, 1) needs 3 indices). The existing MakeIndices method correctly handles this by using ranges[i]->min for each dimension.

🐛 Proposed fix
 For AtomicOpBaseNode::MakeSIMTLoop(arith::Analyzer *analyzer) const {
   Array<IterVar> loop_vars = MakeIterVars();
   bool is_scalar = loop_vars.empty();
   if (is_scalar) {
+    Array<PrimExpr> src_indices = MakeIndices(loop_vars, 0);
+    Array<PrimExpr> dst_indices = MakeIndices(loop_vars, 1);
     return For(Var("i"), 0, 1, ForKind::kSerial,
-               BufferStore(dst, BufferLoad(src, {0}), {0}));
+               BufferStore(dst, BufferLoad(src, src_indices), dst_indices));
   }

Note: The same issue exists in AtomicAddNode::MakeSIMTLoop (lines 145-150 in atomic_add.cc).

🧹 Nitpick comments (1)
src/op/atomic_add.cc (1)

93-103: Extract shared vectorization length logic into a utility function.

The GetVectorizeLength method in AtomicAddNode duplicates the logic from GetMaxVectorSize in src/transform/atomicadd_vectorize.cc (lines 64-73). Both implement identical vectorization length determination (float16/bfloat16 → 2, float32 with SM≥90 → 4, else → 1). Extract this into a shared utility to follow the DRY principle.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 70d4713 and 511ddef.

📒 Files selected for processing (6)
  • src/op/atomic_add.cc
  • src/op/atomic_add.h
  • src/op/atomic_reduce.cc
  • src/op/copy.cc
  • src/op/fill.cc
  • src/transform/loop_vectorize.cc
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2026-01-12T07:25:35.591Z
Learnt from: silentCoder-dev
Repo: tile-ai/tilelang PR: 1631
File: src/transform/thread_storage_sync.cc:1126-1137
Timestamp: 2026-01-12T07:25:35.591Z
Learning: In TileLang's thread storage synchronization pass (src/transform/thread_storage_sync.cc), at the IR level where PointerAccessIsDisjoint is called, the threads array in AccessEntry is guaranteed to contain all three thread dimensions (threadIdx.x, threadIdx.y, threadIdx.z), making access to the last 3 elements via `threads[threads.size() + idx - 3]` safe.

Applied to files:

  • src/op/atomic_reduce.cc
🧬 Code graph analysis (6)
src/op/copy.cc (2)
src/transform/loop_vectorize.cc (4)
  • VectorizeLoop (489-500)
  • VectorizeLoop (489-490)
  • VectorizeLoop (502-512)
  • VectorizeLoop (502-503)
tilelang/transform/__init__.py (1)
  • VectorizeLoop (334-342)
src/op/atomic_add.cc (5)
tilelang/language/ast/ir.py (1)
  • target (1677-1707)
src/transform/atomicadd_vectorize.cc (2)
  • dtype (64-73)
  • dtype (64-64)
src/target/utils.cc (2)
  • TargetHasSMVersionGE (130-135)
  • TargetHasSMVersionGE (130-130)
tilelang/ir.py (1)
  • ParallelOp (60-60)
src/transform/loop_partition.cc (2)
  • LowerParallelLoop (274-310)
  • LowerParallelLoop (274-277)
src/transform/loop_vectorize.cc (2)
src/op/utils.h (1)
  • IsGlobalBuffer (55-57)
src/target/utils.cc (2)
  • TargetHasSMVersionGE (130-135)
  • TargetHasSMVersionGE (130-130)
src/op/atomic_reduce.cc (2)
src/op/atomic_add.cc (6)
  • Clone (72-78)
  • Clone (72-72)
  • GetElemOp (80-80)
  • GetElemOp (80-80)
  • MakeSIMTLoop (146-210)
  • MakeSIMTLoop (146-146)
src/op/atomic_add.h (1)
  • GetElemOp (33-44)
src/op/atomic_add.h (2)
src/op/atomic_reduce.h (4)
  • tl (13-135)
  • AtomicOpBaseNode (24-134)
  • GetElemOp (77-88)
  • GetElemOp (109-120)
src/op/atomic_add.cc (10)
  • Lower (349-586)
  • Lower (349-349)
  • InferLayout (256-312)
  • InferLayout (256-257)
  • GetElemOp (80-80)
  • GetElemOp (80-80)
  • GetVectorizeLength (93-103)
  • GetVectorizeLength (93-93)
  • MakeSIMTLoop (146-210)
  • MakeSIMTLoop (146-146)
src/op/fill.cc (1)
src/transform/loop_vectorize.cc (4)
  • VectorizeLoop (489-500)
  • VectorizeLoop (489-490)
  • VectorizeLoop (502-512)
  • VectorizeLoop (502-503)
🪛 Cppcheck (2.19.0)
src/op/atomic_reduce.cc

[error] 275-275: syntax error

(syntaxError)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: Test for Python 3.12 with Metal (on macos-latest)
  • GitHub Check: Test for Python 3.12 with Nightly-ROCm-7.1 (on self-hosted-amd)
🔇 Additional comments (19)
src/op/copy.cc (1)

745-748: LGTM: LowerParallelLoop call updated correctly.

The call now properly passes T.layout_map and the predicate from par_op->GetPredicate(T.thread_var), aligning with the updated function signature that threads layout information through the lowering pipeline.

src/transform/loop_vectorize.cc (4)

43-71: LGTM: Well-implemented contiguity check.

The IsBufferContiguous function correctly validates row-major layout by checking that strides match the expected pattern (rightmost stride = 1, each stride = next stride × shape). The empty strides case correctly returns true for implicitly contiguous buffers.


182-218: LGTM: Atomic add vectorization logic is sound.

The handling for atomic_add_elem_op correctly:

  • Validates at least 2 arguments
  • Extracts dtype from the address_of(BufferLoad) pattern
  • Applies appropriate vectorize lengths (2 for fp16/bf16, 4 for fp32 on SM≥90, 1 otherwise)
  • Avoids traversing pointer-type arguments that cannot be vectorized

The fix from the past review (using ffi::GetRef<PrimExpr>(node) at line 215) has been correctly applied.


234-265: LGTM: Layout-aware index transformation.

The logic correctly handles layout transformation when a layout_map entry exists:

  1. Forwards indices through the layout
  2. Computes linear offset when dimensions don't match
  3. Decomposes back to buffer shape dimensions

The contiguity check at line 236 ensures this transformation is only applied to row-major buffers.


382-390: LGTM: API functions updated consistently.

The GetVectorizeSize and VectorizeLoop overloads are consistently updated to accept and propagate layout_map through the VectorizePlanner. This maintains API consistency across the vectorization pipeline.

Also applies to: 489-512

src/op/fill.cc (1)

170-171: LGTM: Consistent VectorizeLoop API updates.

All three VectorizeLoop calls are correctly updated to pass T.layout_map as the second argument, matching the new function signature. The changes are consistent across all buffer scope paths (fragment, local, shared/global).

Also applies to: 179-180, 194-195

src/op/atomic_add.cc (2)

45-47: LGTM: Good input validation.

The added argument count check ensures AtomicAdd receives at least 2 arguments (src and dst), providing a clear error message if violated. This is good defensive coding.


564-585: LGTM: Clean lowering pipeline implementation.

The non-TMA lowering path correctly:

  1. Creates a SIMT loop and fuses parallel loops
  2. Wraps in ParallelOp for layout inference
  3. Runs inference at multiple levels (Common, Strict, Free)
  4. Uses LowerParallelLoop with the new layout_map parameter

This aligns with the broader PR goal of threading layout information through the lowering pipeline.

src/op/atomic_add.h (3)

16-21: LGTM: Clean inheritance from AtomicOpBaseNode.

The class correctly inherits from AtomicOpBaseNode, gaining shared functionality for atomic operations (MakeIterVars, MakeIndices, MakePredicate, GetMemoryOrder) while adding TMA-specific extensions.


26-33: LGTM: Override declarations are correct.

The Lower, InferLayout, and GetElemOp overrides are properly declared to add TMA support and provide the element operation for atomic add. The previous issue about an unused LowerTMA declaration has been addressed.


36-44: Reflection registration pattern is consistent with codebase conventions.

The inherited members (src, dst, src_range, dst_range, annotations) are explicitly re-registered in the reflection definition, which matches the established pattern used in AtomicMaxNode, AtomicMinNode, and other operator classes throughout the codebase. This is the standard way TVM's reflection system handles inherited members—no verification concerns.

src/op/atomic_reduce.cc (8)

29-55: LGTM! Bounds check properly added.

The ICHECK for argument count addresses the previous review feedback. The implementation follows the same pattern as AtomicAddNode in atomic_add.cc.


61-87: LGTM! Implementation is consistent with AtomicMax.

The bounds check is properly added, and the implementation mirrors AtomicMax structure correctly.


93-105: LGTM!

The iteration variable generation correctly skips unit-extent dimensions. The variable naming scheme is adequate for typical use cases.


107-155: LGTM!

MakeIndices correctly handles multi-dimensional buffers with unit-extent dimensions by using ranges[i]->min. MakePredicate appropriately uses symbolic analysis to eliminate provably-true conditions.


178-215: LGTM!

The non-scalar path correctly generates SIMT loops with proper indices, atomic call construction, and parallel loop nesting. The polymorphic GetElemOp() call ensures the appropriate element operation is used.


217-233: LGTM!

Layout inference correctly validates that source and destination fragment buffers have matching layouts, preventing subtle data corruption bugs.


235-259: LGTM!

The lowering pipeline correctly creates a SIMT loop, fuses it, performs multi-level layout inference via ParallelOp, and lowers to the final representation.


265-278: LGTM! Registration and reflection setup is correct.

The operator registration with kOpaque call effect is appropriate for atomic operations. The static analysis syntax error on line 275 is a false positive—TVM_FFI_STATIC_INIT_BLOCK is a valid TVM macro for static initialization.

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

@LeiWang1999
Copy link
Member Author

@regression-perf

@github-actions
Copy link

Performance Regression Test Report

Triggered by: @LeiWang1999
Workflow run: https://github.com/tile-ai/tilelang/actions/runs/21032540310

Results

File Original Latency Current Latency Speedup
example_tilelang_gemm_fp8_intrinsic 0.466653 2.53025 0.184429
example_gemm_intrinsics 0.035105 0.115329 0.30439
example_warp_specialize_gemm_copy_0_gemm_1 0.037185 0.038785 0.958747
example_warp_specialize_gemm_barrierpipe_stage2 0.037985 0.039425 0.963475
example_mha_bwd_bshd_wgmma_pipelined 0.0256159 0.0262036 0.977569
example_linear_attn_bwd 0.152292 0.154699 0.984441
example_gqa_decode 0.048481 0.04912 0.986991
example_linear_attn_fwd 0.0366084 0.0370083 0.989194
example_tilelang_gemm_splitk 1.40769 1.42214 0.989841
example_tilelang_gemm_splitk_vectorize_atomicadd 1.40899 1.42322 0.990004
example_warp_specialize_gemm_softpipe_stage2 0.038433 0.038753 0.991743
example_mha_inference 0.080514 0.081057 0.993301
example_mha_fwd_varlen 0.0453498 0.0455049 0.996593
example_mha_bwd_bshd 0.0408328 0.0409236 0.997781
example_tilelang_gemm_fp8 0.322312 0.32279 0.998518
example_gemm_autotune 0.022336 0.022369 0.998525
example_gqa_bwd 0.0498593 0.0498887 0.999411
example_gemm_schedule 0.032574 0.0325874 0.99959
example_gemv 0.288668 0.288783 0.999602
example_elementwise_add 0.297587 0.2977 0.999623
example_mha_bwd_bhsd 0.0400384 0.0400497 0.999716
block_sparse_attn_tilelang 0.0102953 0.0102968 0.999855
example_fusedmoe_tilelang 0.141385 0.14139 0.999959
tilelang_example_sparse_tensorcore 0.015094 0.015091 1.00019
example_dynamic 0.657198 0.656909 1.00044
example_tilelang_gemm_fp8_2xAcc 0.187524 0.187364 1.00085
example_convolution_autotune 1.00124 0.993948 1.00734
example_dequant_gemv_fp16xint4 0.0286961 0.0284301 1.00936
example_per_token_cast_to_fp8 0.00747716 0.00738219 1.01286
example_tilelang_nsa_decode 0.00685351 0.00675935 1.01393
example_dequant_groupedgemm_bf16_mxfp4_hopper 3.52436 3.47262 1.0149
example_tilelang_nsa_fwd 0.00712136 0.00701134 1.01569
example_topk 0.011104 0.010913 1.0175
example_gemm 0.022977 0.022464 1.02284
sparse_mla_bwd 0.395898 0.383973 1.03106
example_dequant_gemm_w4a8 5.58144 5.39692 1.03419
example_warp_specialize_gemm_copy_1_gemm_0 0.039426 0.038113 1.03445
topk_selector 0.0560102 0.0540018 1.03719
example_mha_sink_fwd_bhsd_wgmma_pipelined 0.0160209 0.0154166 1.03919
fp8_lighting_indexer 0.037402 0.0359562 1.04021
example_mha_sink_fwd_bhsd_wgmma_pipelined_sliding_window 0.016094 0.0154692 1.04039
example_mha_sink_fwd_bhsd_sliding_window 0.016314 0.015668 1.04123
example_gqa_sink_fwd_bhsd_wgmma_pipelined 0.0150518 0.0144549 1.04129
example_blocksparse_gemm 0.0236474 0.0227097 1.04129
example_mha_sink_bwd_bhsd 0.0651034 0.0624531 1.04244
example_gqa_sink_fwd_bhsd_wgmma_pipelined_sliding_window 0.0152362 0.014605 1.04322
sparse_mla_fwd 0.13639 0.130737 1.04324
example_mha_sink_bwd_bhsd_sliding_window 0.0464689 0.0445286 1.04358
example_gqa_sink_bwd_bhsd_sliding_window 0.0268353 0.0256382 1.04669
example_dequant_gemm_bf16_fp4_hopper 0.601869 0.574508 1.04763
example_dequant_gemm_fp4_hopper 1.12009 1.0686 1.04818
example_group_per_split_token_cast_to_fp8 0.010807 0.0102936 1.04988
example_gqa_sink_bwd_bhsd 0.0438115 0.0416913 1.05086
example_dequant_gemm_bf16_mxfp4_hopper 0.54382 0.517003 1.05187
example_mla_decode 0.487147 0.461322 1.05598
sparse_mla_fwd_pipelined 0.100132 0.0947812 1.05645
example_convolution 1.41357 1.33203 1.06121
example_mha_sink_fwd_bhsd 0.0165399 0.0155737 1.06204
example_gqa_bwd_wgmma_pipelined 0.074261 0.0697841 1.06415
example_gqa_bwd_tma_reduce_varlen 0.0637587 0.0523528 1.21787

Artifacts

  • regression_result.png (speedup plot) is attached as a workflow artifact. Download it from the workflow run page above.

@LeiWang1999
Copy link
Member Author

@regression-perf

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@src/transform/loop_vectorize.cc`:
- Around line 182-207: The comment for handling atomic_add_elem_op incorrectly
says args are not visited, but the code calls
arith::IRMutatorWithAnalyzer::VisitExpr_(node) which does visit args; either
stop visiting args by returning the original call expression immediately in the
atomic_add_elem_op branch (i.e., avoid calling VisitExpr_ so pointer args are
not traversed) or keep the current behavior and update the comment to state that
args are intentionally visited; modify the branch handling atomic_add_elem_op
(the block examining address_of, BufferLoadNode, dtype and setting vector_size_)
to implement one of these two fixes so the code and comment are consistent.
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 351003a and 1169199.

📒 Files selected for processing (1)
  • src/transform/loop_vectorize.cc
🧰 Additional context used
🧬 Code graph analysis (1)
src/transform/loop_vectorize.cc (2)
src/op/utils.h (1)
  • IsGlobalBuffer (55-57)
src/target/utils.cc (2)
  • TargetHasSMVersionGE (130-135)
  • TargetHasSMVersionGE (130-130)
🔇 Additional comments (7)
src/transform/loop_vectorize.cc (7)

43-71: LGTM!

The IsBufferContiguous function correctly validates row-major contiguity by checking that strides follow the expected pattern: strides[n-1] = 1 and strides[i] = strides[i+1] * shape[i+1]. The use of analyzer->CanProveEqual for symbolic comparison is appropriate.


106-108: LGTM!

The constructor update cleanly adds optional LayoutMap support while maintaining backward compatibility through the default empty map.


152-171: LGTM!

Good refactor to use IsSharedBuffer/IsGlobalBuffer utility functions instead of inline scope string comparisons. This improves maintainability and consistency with other parts of the codebase.


238-272: LGTM with a minor note.

The layout map handling logic is correct:

  • Index forwarding through layout
  • Linear offset computation and decomposition when dimensions differ
  • Proper reversal of decomposed indices

The ICHECK on line 244 provides a clear assertion for the contiguity requirement, which is appropriate for catching developer errors early.


389-397: LGTM!

The GetVectorizeSize overloads correctly thread the LayoutMap parameter to VectorizePlanner, maintaining a clean and consistent API.


496-519: LGTM!

The VectorizeLoop overloads properly propagate the LayoutMap parameter, maintaining consistency with the GetVectorizeSize functions and backward compatibility.


212-219: LGTM!

Explicitly allowing bitwise operations to be vectorized is appropriate since these operations are inherently element-wise and can benefit from SIMD vectorization.

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

Comment on lines 182 to 207
} else if (node->op == tl::atomic_add_elem_op()) {
// Assert at least 2 args (dst_ptr and src)
ICHECK(node->args.size() >= 2)
<< "atomic_add_elem_op requires at least 2 args (dst and src)";

// Get dst dtype from args[0] (address_of call containing BufferLoad)
auto address_of_call = node->args[0].as<CallNode>();
ICHECK(address_of_call && address_of_call->op == builtin::address_of())
<< "atomic_add_elem_op first arg must be address_of call";

auto buffer_load = address_of_call->args[0].as<BufferLoadNode>();
ICHECK(buffer_load) << "address_of arg must be BufferLoad";

DataType dtype = buffer_load->buffer->dtype;
int vectorize_length = 1;
if (dtype.is_float16() || dtype.is_bfloat16()) {
vectorize_length = 2;
} else if (dtype.is_float() && dtype.bits() == 32 &&
TargetHasSMVersionGE(Target::Current(false), 90)) {
vectorize_length = 4;
}

vector_size_ = arith::ZeroAwareGCD(vector_size_, vectorize_length);
// Do not visit the args of atomic_add_elem_op, because pointer type
// is impossible to vectorize
return arith::IRMutatorWithAnalyzer::VisitExpr_(node);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Comment contradicts behavior: args are still visited.

The comment on lines 205-206 states "Do not visit the args of atomic_add_elem_op, because pointer type is impossible to vectorize," but line 207 calls arith::IRMutatorWithAnalyzer::VisitExpr_(node) which will recursively visit all arguments.

If the intent is to skip visiting args (as done for "Other calls" on line 223), this should return early without calling the base class visitor:

🔧 Suggested fix to skip arg visitation
       vector_size_ = arith::ZeroAwareGCD(vector_size_, vectorize_length);
       // Do not visit the args of atomic_add_elem_op, because pointer type
       // is impossible to vectorize
-      return arith::IRMutatorWithAnalyzer::VisitExpr_(node);
+      return ffi::GetRef<PrimExpr>(node);

Alternatively, if visiting args is intentional, update the comment to reflect the actual behavior.

🤖 Prompt for AI Agents
In `@src/transform/loop_vectorize.cc` around lines 182 - 207, The comment for
handling atomic_add_elem_op incorrectly says args are not visited, but the code
calls arith::IRMutatorWithAnalyzer::VisitExpr_(node) which does visit args;
either stop visiting args by returning the original call expression immediately
in the atomic_add_elem_op branch (i.e., avoid calling VisitExpr_ so pointer args
are not traversed) or keep the current behavior and update the comment to state
that args are intentionally visited; modify the branch handling
atomic_add_elem_op (the block examining address_of, BufferLoadNode, dtype and
setting vector_size_) to implement one of these two fixes so the code and
comment are consistent.

@github-actions
Copy link

Performance Regression Test Report

Triggered by: @LeiWang1999
Workflow run: https://github.com/tile-ai/tilelang/actions/runs/21036663931

Results

File Original Latency Current Latency Speedup
example_warp_specialize_gemm_barrierpipe_stage2 0.038657 0.040289 0.959493
example_warp_specialize_gemm_copy_0_gemm_1 0.037889 0.038912 0.97371
example_mha_bwd_bshd_wgmma_pipelined 0.025718 0.0262144 0.981063
example_warp_specialize_gemm_copy_1_gemm_0 0.038272 0.038944 0.982744
example_linear_attn_bwd 0.152379 0.154728 0.984817
example_topk 0.010977 0.011104 0.988563
example_tilelang_gemm_splitk 1.40748 1.42278 0.989244
example_linear_attn_fwd 0.0366075 0.0370045 0.989271
example_mha_sink_bwd_bhsd 0.0617174 0.0623242 0.990264
example_tilelang_gemm_splitk_vectorize_atomicadd 1.41022 1.42299 0.991027
example_mha_inference 0.079297 0.08 0.991213
example_convolution_autotune 0.993208 1.00002 0.993191
example_gqa_decode 0.048898 0.049025 0.997409
example_convolution 1.3324 1.33545 0.997715
example_blocksparse_gemm 0.0226922 0.0227353 0.998104
example_gqa_sink_bwd_bhsd_sliding_window 0.025584 0.0256217 0.998529
example_gemm_schedule 0.0325786 0.0326232 0.998633
example_gemv 0.288756 0.288923 0.999423
example_gqa_sink_bwd_bhsd 0.0416738 0.041697 0.999444
example_group_per_split_token_cast_to_fp8 0.0102831 0.0102871 0.999617
example_fusedmoe_tilelang 0.141381 0.141418 0.999733
example_mla_decode 0.461319 0.461416 0.99979
example_gqa_bwd 0.0498051 0.0498114 0.999873
block_sparse_attn_tilelang 0.0102955 0.0102963 0.999915
example_dequant_gemm_bf16_mxfp4_hopper 0.508842 0.508872 0.999941
example_tilelang_nsa_fwd 0.00701385 0.00701395 0.999986
example_gemm_intrinsics 0.035137 0.035137 1
example_tilelang_gemm_fp8_2xAcc 0.187282 0.187269 1.00007
example_elementwise_add 0.297707 0.297638 1.00023
example_dequant_gemm_w4a8 5.39917 5.39781 1.00025
fp8_lighting_indexer 0.0359907 0.0359758 1.00041
example_tilelang_nsa_decode 0.00676021 0.0067567 1.00052
example_dequant_gemv_fp16xint4 0.0284411 0.0284257 1.00054
tilelang_example_sparse_tensorcore 0.015094 0.0150855 1.00056
example_dynamic 0.657036 0.656652 1.00058
example_gqa_sink_fwd_bhsd_wgmma_pipelined 0.0144752 0.0144617 1.00093
example_mha_bwd_bhsd 0.0402058 0.040151 1.00137
example_tilelang_gemm_fp8 0.321237 0.320564 1.0021
example_per_token_cast_to_fp8 0.00738372 0.00736796 1.00214
example_gqa_sink_fwd_bhsd_wgmma_pipelined_sliding_window 0.0146208 0.0145891 1.00217
example_mha_sink_fwd_bhsd_wgmma_pipelined 0.0155006 0.0154631 1.00243
topk_selector 0.0541186 0.0539536 1.00306
example_dequant_gemm_bf16_fp4_hopper 0.575433 0.573611 1.00318
example_mha_bwd_bshd 0.0410689 0.0409056 1.00399
example_mha_sink_bwd_bhsd_sliding_window 0.0447355 0.044552 1.00412
example_gemm 0.02272 0.022625 1.0042
example_mha_fwd_varlen 0.045756 0.045514 1.00532
sparse_mla_fwd 0.131867 0.131149 1.00547
example_mha_sink_fwd_bhsd_sliding_window 0.0157815 0.0156809 1.00641
sparse_mla_bwd 0.386536 0.383996 1.00661
example_mha_sink_fwd_bhsd_wgmma_pipelined_sliding_window 0.0155719 0.01545 1.0079
example_gemm_autotune 0.022528 0.022337 1.00855
example_warp_specialize_gemm_softpipe_stage2 0.039008 0.038625 1.00992
sparse_mla_fwd_pipelined 0.0956784 0.0947169 1.01015
example_mha_sink_fwd_bhsd 0.0159417 0.0156215 1.0205
example_dequant_groupedgemm_bf16_mxfp4_hopper 3.58445 3.49219 1.02642
example_dequant_gemm_fp4_hopper 1.07653 1.03189 1.04326
example_gqa_bwd_wgmma_pipelined 0.0745268 0.0698792 1.06651
example_gqa_bwd_tma_reduce_varlen 0.0637903 0.0523218 1.21919
example_tilelang_gemm_fp8_intrinsic 0.934136 0.467096 1.99988

Artifacts

  • regression_result.png (speedup plot) is attached as a workflow artifact. Download it from the workflow run page above.

@LeiWang1999 LeiWang1999 merged commit b27fb92 into tile-ai:main Jan 15, 2026
2 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants